Skip to content

Run the cudf-polars test suite against DaskEngine and RayEngine#22381

Merged
rapids-bot[bot] merged 28 commits intorapidsai:mainfrom
madsbk:engine_reset-test-dask-and-ray
May 7, 2026
Merged

Run the cudf-polars test suite against DaskEngine and RayEngine#22381
rapids-bot[bot] merged 28 commits intorapidsai:mainfrom
madsbk:engine_reset-test-dask-and-ray

Conversation

@madsbk
Copy link
Copy Markdown
Member

@madsbk madsbk commented May 5, 2026

Builds on the cached streaming_engines fixture from #22364, which amortizes SPMD bootstrap via _reset(), and extends the same pattern to Dask and Ray.

With this change, the test matrix runs against:

["in-memory", "spmd", "spmd-small", "dask", "ray"]

subject to package availability and rrun gating.

We might change the different setups later, but for now CI runs:

Engine Block Size(s) GPU Configuration
SPMDEngine "medium", "small" Single GPU
DaskEngine "medium" Single GPU
RayEngine "medium" Two GPUs

@madsbk madsbk self-assigned this May 5, 2026
@madsbk madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 5, 2026
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 5, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 5, 2026
@madsbk madsbk force-pushed the engine_reset-test-dask-and-ray branch 3 times, most recently from 7bb501d to 4c5b5da Compare May 6, 2026 11:06
@madsbk madsbk force-pushed the engine_reset-test-dask-and-ray branch from d088a13 to 3fccdf3 Compare May 6, 2026 13:33
@madsbk madsbk force-pushed the engine_reset-test-dask-and-ray branch from 3fccdf3 to b294bf8 Compare May 6, 2026 14:05
Comment thread .github/workflows/pr.yaml
# (rapidsmpf compatibility already validated in rapidsmpf CI)
matrix_filter: map(select(.ARCH == "amd64")) | group_by(.CUDA_VER|split(".")|map(tonumber)|.[0]) | map(max_by([(.PY_VER|split(".")|map(tonumber)), (.CUDA_VER|split(".")|map(tonumber))]))
build_type: pull-request
container-options: "--cap-add CAP_SYS_PTRACE --shm-size=8g --ulimit=nofile=1000000:1000000"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are necessary to provide enough resources for UCX. For reference, we already do the same for UCXX and RapidsMPF, both need it for the same reason:

@madsbk madsbk marked this pull request as ready for review May 6, 2026 18:56
@madsbk madsbk requested a review from a team as a code owner May 6, 2026 18:56
madsbk added 7 commits May 7, 2026 07:52
The ConditionalJoin lowering only triggered the multi-partition fallback
when the smaller side had more than one partition, missing asymmetric
cases such as (left=2, right=1).

In those cases no Repartition was inserted, so on multi-rank engines the
single right partition only existed on one rank. Peer ranks executed the
conditional join against an empty right side and silently dropped rows.

For example, with num_ranks=2 a join_where producing 90 rows on CPU
returned only 54 rows on RayEngine, missing the 36 rows from rank 1.

Fix this by gating the fallback on output_count > 1, i.e.
max(left_count, right_count) > 1. This ensures the smaller side is
repartitioned into a single broadcastable partition whenever either
input is multi-partition. The single-partition equi-case remains
unchanged.

Surfaced while enabling DaskEngine and RayEngine in the cudf-polars test
fixture matrix. The failing cases:

  test_join_conditional[{dask,ray}-9-{True,False}]

reported:

  DataFrames are different (height mismatch)
  [left]: 90
  [right]: 54

With this fix all 16 cases pass (4 engines × 2 reverse × 2
max_rows_per_partition), and SPMD now correctly emits the fallback
warning for max_rows_per_partition=9.
@madsbk madsbk requested review from TomAugspurger and pentschev May 7, 2026 08:07
fallback_msg = "ConditionalJoin not supported for multiple partitions."
if left_count < right_count:
if left_count > 1 or dynamic_planning:
if output_count > 1 or dynamic_planning:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a bug in the conditional-join fallback logic. We need to repartition the smaller table to a single broadcastable partition whenever either side has more than one partition.

Previously the fallback only triggered when the smaller side itself had multiple partitions, which breaks asymmetric cases like (left=2, right=1). In that situation the single right partition only exists on one rank, so peer ranks execute the conditional join against an empty right side and silently drop rows.

@TomAugspurger and @rjzamora, does that sound correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to Rick on this...

Though if the new way is correct, then maybe the logic would be a bit more obvious if we combine the first two if conditions?

if (left_count < right_count) and (output_count > 1 or dynamic_planning): 
    left = Repartition(left.schema, left)
    pi_left[left] = PartitionInfo(count=1)
    _fallback_inform(fallback_msg, config_options)
elif output_count > 1 or dynamic_planning:
    right = Repartition(right.schema, right)
    pi_right[right] = PartitionInfo(count=1)
    _fallback_inform(fallback_msg, config_options)

Then the repeated condition (left_count < right_count) will maybe a bit a bit clear (that's true, but the second condition output_count > 1 or dynamic_planning isn't.

At least I hope those are logically equivalent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are maybe disabling dynamic planning for some join tests? By default, dynamic-planning is always on, so we always repartition the smaller side.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are maybe disabling dynamic planning for some join tests? By default, dynamic-planning is always on, so we always repartition the smaller side.

Maybe, it only happens when running with multiple GPUs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if the new way is correct, then maybe the logic would be a bit more obvious if we combine the first two if conditions?

Yes, and we can even simplify a bit more. Done

Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CI and packaging!

@TomAugspurger
Copy link
Copy Markdown
Contributor

It looks like the wheel-tests-cudf-polars job is now taking >2 hours, which feels excessive: https://github.com/rapidsai/cudf/actions/runs/25492512677/job/74812354023?pr=22381.

Part of that is because we're running these tests against multiple versions of polars. I think we could get away with running the full suite of tests (all tests, all engines) against only the latest version of polars.

But I also wonder whether we're still unexpectedly creating expensive resources like Ray actors in too many places.

Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments but I won't block this PR. I don't think it's wise for us to hardcode test details to fit CI, rather we should make them configurable and make CI fit that instead.

Comment thread python/cudf_polars/tests/conftest.py
Comment thread python/cudf_polars/tests/conftest.py
@madsbk
Copy link
Copy Markdown
Member Author

madsbk commented May 7, 2026

It looks like the wheel-tests-cudf-polars job is now taking >2 hours, which feels excessive: https://github.com/rapidsai/cudf/actions/runs/25492512677/job/74812354023?pr=22381.

Part of that is because we're running these tests against multiple versions of polars. I think we could get away with running the full suite of tests (all tests, all engines) against only the latest version of polars.

But I also wonder whether we're still unexpectedly creating expensive resources like Ray actors in too many places.

I think this was just a slower runner. Note that its sibling job only took 16 minutes. In fact, I do not even think wheel-tests-cudf-polars runs with Ray? I only added Ray to the wheel-tests-cudf-polars-with-rapidsmpf runs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 83ee7d38-11c5-44ae-8e2f-c6db96c4637d

📥 Commits

Reviewing files that changed from the base of the PR and between c9ad1c5 and 83059fd.

📒 Files selected for processing (22)
  • .github/workflows/pr.yaml
  • .github/workflows/test.yaml
  • ci/run_cudf_polars_experimental_pytests.sh
  • ci/test_cudf_polars_experimental.sh
  • dependencies.yaml
  • python/cudf_polars/cudf_polars/experimental/join.py
  • python/cudf_polars/cudf_polars/testing/engine_utils.py
  • python/cudf_polars/pyproject.toml
  • python/cudf_polars/tests/conftest.py
  • python/cudf_polars/tests/experimental/test_all_gather_host_data.py
  • python/cudf_polars/tests/experimental/test_dataframescan.py
  • python/cudf_polars/tests/experimental/test_filter.py
  • python/cudf_polars/tests/experimental/test_groupby.py
  • python/cudf_polars/tests/experimental/test_io_multirank.py
  • python/cudf_polars/tests/experimental/test_join.py
  • python/cudf_polars/tests/experimental/test_metadata.py
  • python/cudf_polars/tests/experimental/test_parallel.py
  • python/cudf_polars/tests/experimental/test_rolling.py
  • python/cudf_polars/tests/experimental/test_select.py
  • python/cudf_polars/tests/experimental/test_spilling.py
  • python/cudf_polars/tests/experimental/test_statistics.py
  • python/cudf_polars/tests/experimental/test_unique.py
💤 Files with no reviewable changes (1)
  • python/cudf_polars/tests/experimental/test_all_gather_host_data.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added Ray as an optional dependency for distributed computing support
  • Bug Fixes

    • Improved conditional join partition handling for multi-rank scenarios
  • Tests

    • Enhanced multi-rank distributed test infrastructure with improved engine handling and warning detection
    • Extended test fixtures for SPMD-only and streaming engine variants
  • Chores

    • Updated CI container configuration with increased shared memory and file limits for test stability
    • Simplified test suite logging messages

Walkthrough

This PR extends cudf-polars' test infrastructure to support multi-rank Ray streaming execution alongside SPMD. Ray is added as an optional dependency, container runtime options are configured for CI, and tests are refactored to use conditional fixtures and warning helpers that adapt behavior based on engine type and rank configuration.

Changes

Ray Dependency & Infrastructure Setup

Layer / File(s) Summary
Ray Dependency Declarations
dependencies.yaml, python/cudf_polars/pyproject.toml, ci/test_cudf_polars_experimental.sh
Ray ≥2.55.1 added as optional dependency; new dependency group py_run_cudf_polars_ray created in manifest.
CI Container & Script Configuration
.github/workflows/pr.yaml, .github/workflows/test.yaml, ci/run_cudf_polars_experimental_pytests.sh
Container options configure CAP_SYS_PTRACE, 8GB shared memory, and 1M file descriptor limit for wheel-tests-cudf-polars-with-rapidsmpf job; test message simplified.
Test Helper Functions
python/cudf_polars/cudf_polars/testing/engine_utils.py
New warns_on_spmd(engine, *args, when=True, **kwargs) helper conditionally applies pytest.warns for SPMD engines only; streaming-engine fixture params gated behind rapidsmpf.bootstrap.is_running_with_rrun() check; allow_gpu_sharing=True added to baselines.
Test Fixtures & Markers
python/cudf_polars/tests/conftest.py
NUM_RANKS = 2 constant added; RayEngine registration with pinned ranks and GPU-sharing options; new spmd_engine_factory fixture for SPMD-only creation; skip_on_streaming_engine marker extended with optional engine= keyword filtering.
Core Logic Update
python/cudf_polars/cudf_polars/experimental/join.py
ConditionalJoin repartitioning logic updated to repartition multi-partition sides to single partition when either side has >1 partition or dynamic planning enabled.

Test Suite Refactoring for Multi-Rank Support

Layer / File(s) Summary
Factory Fixture Migration
python/cudf_polars/tests/experimental/test_select.py, test_parallel.py, test_metadata.py, test_spilling.py, test_groupby.py
Tests requiring SPMD-specific execution migrate from streaming_engine_factory to spmd_engine_factory to ensure deterministic single-rank setup.
Simplified Multi-Backend Fixtures
python/cudf_polars/tests/experimental/test_io_multirank.py, test_statistics.py
Test fixtures consolidate from explicit backend branching (["spmd", "ray", "dask"] params) to single streaming_engine_factory-created fixtures, removing environment checks and backend-specific imports.
Warning Assertion Updates
python/cudf_polars/tests/experimental/test_filter.py, test_join.py, test_select.py, test_unique.py
pytest.warns replaced with warns_on_spmd(...) to conditionally assert warnings only for SPMD engines; warning visibility on Ray/Dask is not enforced in tests.
Conditional Runtime Markers
python/cudf_polars/tests/experimental/test_dataframescan.py, test_all_gather_host_data.py, test_rolling.py, test_groupby.py
xfail/skip markers applied conditionally at runtime based on engine type or rank count; e.g., test_dataframescan_concat xfails only when nranks > 1, UUID uniqueness assertion removed from cluster-info test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main objective of the pull request: extending test coverage to run the cudf-polars test suite against DaskEngine and RayEngine, which aligns with the changeset's primary focus.
Description check ✅ Passed The description is directly related to the changeset, explaining the extension of the cached streaming_engines fixture to support Dask and Ray engines and detailing the new test matrix configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about adding a run_constraints for ray in the conda recipe.

Comment thread dependencies.yaml
common:
- output_types: [conda, requirements, pyproject]
packages:
- ray>=2.55.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conda/recipes/cudf-polars/recipe.yaml, we'll probably want to add a

  run_constraints:
    - ray >=2.55.1

to mirror this constraint

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray is still optional, this is just for the CI testing

Copy link
Copy Markdown
Contributor

@mroeschke mroeschke May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, IIRC run-constraints is like specifying a version constraint for optional runtime dependencies (like what got added to the pyproject.toml)

https://rattler-build.prefix.dev/latest/reference/recipe_file/#run-constraints

Packages that are optional at runtime but must obey the supplied additional constraint if they are installed.

But can be done in a follow up since the CI is currently green

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know that, thanks.
But yes, let's do it in a follow-up PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, opened #22414

@madsbk
Copy link
Copy Markdown
Member Author

madsbk commented May 7, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 16c6356 into rapidsai:main May 7, 2026
377 of 384 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python May 7, 2026
galipremsagar pushed a commit to galipremsagar/cudf that referenced this pull request May 8, 2026
…apidsai#22381)

Builds on the cached `streaming_engines` fixture from rapidsai#22364, which amortizes SPMD bootstrap via `_reset()`, and extends the same pattern to Dask and Ray.

With this change, the test matrix runs against:

`["in-memory", "spmd", "spmd-small", "dask", "ray"]`

subject to package availability and `rrun` gating.

We might change the different setups later, but for now CI runs:

| Engine        | Block Size(s)         | GPU Configuration |
|----------------|-----------------------|-------------------|
| `SPMDEngine`   | `"medium"`, `"small"` | Single GPU        |
| `DaskEngine`   | `"medium"`            | Single GPU        |
| `RayEngine`    | `"medium"`            | Two GPUs          |

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Bradley Dice (https://github.com/bdice)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#22381
@madsbk madsbk deleted the engine_reset-test-dask-and-ray branch May 9, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants